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Abstract: It is now widely accepted that knowledge can be acquired from 
networks by clustering their vertices according to connection profiles. Many 
methods have been proposed and in this paper we concentrate on the Stochastic 
Block Model (SBM). The clustering of vertices and the estimation of SBM 
model parameters have been subject to previous work and numerous inference 
strategies such as variational Expectation Maximization (EM) and classification 
EM have been proposed. However, SBM still suffers from a lack of criteria to 
estimate the number of components in the mixture. To our knowledge, only one 
model based criterion, ICL, has been derived for SBM in the literature. It relies 
on an asymptotic approximation of the Integrated Complete-data Likelihood 
and recent studies have shown that it tends to be too conservative in the case 
of small networks. To tackle this issue, we propose a new criterion that we call 
ILvb, based on a non asymptotic approximation of the marginal likelihood. We 
describe how the criterion can be computed through a variational Bayes EM 
algorithm. 

Key words: Random graphs, Stochastic block models, Community detec- 
tion, Variational EM, Variational Bayes EM, Integrated complete-data likeli- 
hood, Integrated observed-data likelihood 
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1 Introduction 



Networks are used in many scientific fields such as biology ( Albert and Barabasi 



20021 and social sciences (Snijders and Nowicki 1997 Nowicki and Snijders 



2001 1. They aim at modelling with edges the way objects of interest are related 



to each other. Examples of such data sets are friendship (Palla et al 2007) 



protein-protein interaction networks (Barabasi and Oltvai 



20041, powergrids 



(Watts and Strogatz 1998) and the Internet ( Zanghi et al|2008 ). In this context 



a lot of attention has been paid on developing models to learn knowledge from 
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the network topology. It appears that available methods can be grouped into 
three significant categories. 

Some models look for community structure, also called homophily or assor- 
tative mixing ( Girvan and Newman|200"2 Danon et al||2005 ). Given a network, 
the vertices are partitioned into classes such that vertices of a class are mostly 
connected to vertices of the same class. In the model of Handcock et al (2007), 
which extends Hoff et al ( 2002 1 , vertices are clustered depending on their posi- 
tions in a continuous latent space. They proposed a two-stage maximum likeli- 
hood approach and a Bayesian algorithm, as well as an asymptotic BIC criterion 
to estimate the number of latent classes. The two-stage maximum likelihood 
approach first maps the vertices in the latent space and then uses a mixture 
model to cluster the resulting positions. In practice, this procedure converges 
quickly but looses some information by not estimating the positions and the 
cluster model at the same time. Conversely, the Bayesian algorithm, based on 
Markov Chain Monte Carlo, estimates both the latent positions and the mixture 
model parameters simultaneously. It gives better results but is time consuming. 
Both the maximum likelihood and the Bayesian approach are implemented in 
the R package "latentnet" (Krivitsky and Handcock 2009). 

Other models look for disassortative mixing, in which vertices mostly con- 
nect to vertices of different classes (Estrada and Rodriguez- Velazquez 2005). 
They are particularly suitable for the analysis of bipartite networks which are 
used in numerous applications. Examples of data sets having such structures are 
transcriptional regulatory networks where operons encode transcription factors 
directly involved in operons regulation. To get some insight into the transcrip- 
tion process, these two types of nodes are often grouped into different classes 
with high inter connection probabilities. Other examples are citation networks 
where authors cite or are cited by papers. For a more detailed description 
of the differences between community structure and disassortative mixing, see 



Newman and Leicht (2007). 



Finally, a few models can look for both community structure and disassor- 
tative mixing. Hofman and Wiggins ( 2008 ) proposed a probabilistic framework, 
as well as an efficient clustering algorithm. Their model, implemented in the 
software "VBMOD", is based on two key parameters A and e. Given a net- 
work, it assumes that vertices connect with probability A if they belong to the 
same class and with probability e otherwise. Moreover, they introduced a non 
asymptotic Bayesian criterion to estimate the number of classes. It is based on 
a variational approximation of the marginal likelihood and has shown promising 
results. In this paper, we focus on the Stochastic Block Model (SBM) which was 



originally developed in social sciences (White et al 



man|[T98Tl |Frank and Harary|[l982l |Holland et "al 



1976 



1983 



Fienberg and Wasser- 



Snijders and Nowicki 



19971. Given a network, SBM assumes that each vertex belongs to a hidden 



class among Q classes, and uses a Q x Q matrix tv to describe the intra and 
inter connection probabilities. Moreover, the class proportions are represented 
using a Q-dimensional vector a. No assumption is made on the form of the con- 
nectivity matrix such that very different structures can be taken into account. 
In particular, SBM can characterize the presence of hubs which make networks 
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locally dense (Daudin et al 2008). Moreover and to some extent, it general- 
izes many of the existing graph clustering techniques as shown in 



Newman and 



Lcicht ( 2007 1 . For instance, the model of Hofman and Wiggins ( 2008 ) can be 



seen as a constrained SBM where the diagonal of 7r is set to A and all the other 
elements to e. 

Many methods have been proposed in the literature to jointly estimate SBM 
model parameters and cluster the vertices of a network. They all face the same 
difficulty. Indeed, contrary to many mixture models, the conditional distribu- 
tion of all the latent variables Z and model parameters, given the observed data 
X, can not be factorized due to conditional dependency (for more details, see 



Daudin et al 20081. Therefore, optimization techniques such as the EM algo- 



rithm can not be used directly. Nowicki and Snijders ( 2001 ) proposed a Bayesian 
probabilistic approach. They introduced some prior Dirichlet distributions for 
the model parameters and used Gibbs sampling to approximate the posterior 
distribution over the model parameters and posterior predictive distribution. 
Their algorithm is implemented in the software BLOCKS, which is part of the 
package StoCNET ( Boer et al|2006 ). It gives accurate a posteriori estimates but 



can not handle networks with more than 200 vertices. Daudin et al (20081 pro- 
posed a frequentist variational EM approach for SBM which can handle much 
larger networks. Online strategies have also been developed ( Zanghi et al|2008 ). 

While many inference strategies have been proposed for estimation and clus- 
tering purpose, SBM still suffers from a lack of criteria to estimate the number 
of classes in networks. Indeed, many criteria, such as the Bayesian Information 



Criterion (BIC) or the Akaike Information Criterion (AIC) (Burnham and An- 



derson 2004) are based on the likelihood p(X|a,7r) of the observed data X, 
which is intractable here. To tackle this issue, Mariadassou et al (2010) and 



Daudin et al (2008) used a criterion, so-called ICL, based on an asymptotic 
approximation of the integrated complete-data likelihood. This criterion relies 
on the joint distribution p(X, Z | a, 7r) rather than p(X | a, 7r) and can be easily 
computed, even in the case of SBM. ICL was originally proposed by Biernacki 



et al (20001 for model selection in Gaussian mixture models, and is known to 



be particularly suitable for cluster analysis view since it favors well separated 



clusters. However, because it relies on an asymptotic approximation, Biernacki 



et al (2010) showed, in the case of mixtures of multivariate multinomial dis- 



tributions, that it may fail to detect interesting structures present in the data, 
for small sample sizes. Mariadassou et al (20101 obtained similar results when 



analyzing networks generated using SBM. They found that this asymptotic cri- 
terion tends to underestimate the number of classes when dealing with small 
networks. We emphasize that, to our knowledge, ICL is currently the only model 
based criterion developed for SBM. 

Our main concern in this paper is to propose a new criterion for SBM, based 
on the marginal likelihood p(X), also called integrated observed-data likelihood. 
The marginal likelihood is known to focus on density estimation view and is 
expected to provide a consistent estimation of the distribution of the data. For 
a more detailed overview of the differences between integrated complete-data 
likelihood and integrated observed-data likelihood, we refer to |Biernacki et al| 
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(2010). In the case of SBM, the marginal likelihood is not tractable and we 
describe in this paper how a non asymptotic approximation can be obtained 
through a variational Bayes EM algorithm. 

In Section [2j we describe SBM and we introduce some non informative con- 
jugate prior distributions for the model parameters. The variational Bayes EM 
algorithm is then presented in Section [3| We show in Section [4] how it nat- 
urally leads to a new model selection criterion that we call ILvb, based on a 
non asymptotic approximation of the marginal likelihood. Finally, in Section 
[5j we carry out some experiments using simulated data sets and the metabolic 
network of Escherichia coli, to assess ILvb. 

The R package "mixer" implementing this work is available from the following 
web site: http://crcOi.r-project.org. 



2 A Mixture Model for Graphs 

The data we model consists of a N x N binary matrix X, with entries Xij 
describing the presence or absence of an edge from vertex i to vertex j. Both 
directed and undirected relations can be analyzed but in the following, we focus 
on undirected relations. Therefore X is symmetric. 



2.1 Model and Notations 



The Stochastic Block Model (SBM) introduced by |Nowicki and Snijders| ( |2001 1 
associates to each vertex of a network a latent variable Zj drawn from a multi- 
nomial distribution, such that Zi q = 1 if vertex i belongs to class q 

Zj ~ M. 1 1, a — (ati, oc-2, . . . , a 



Q. 

We denote a, the vector of class proportions. The edges are then drawn from 
Bernoulli distribution 



Xij\{Z iq Zjl = 1} ~ B{-K q l), 

where 7r is a Q x Q matrix of connection probabilities. According to this model, 
the latent variables Zi, . . . , Z^r are iid and given this latent structure, all the 
edges are supposed to be independent. Note that SBM was originally described 
in a more general setting, allowing any discrete relational data. However, as 
explained previously, we concentrate in the following on binary edges only. 



4 



Thus, when considering an undirected graph without self loops, this leads to 



iV 



AT Q 



P (z\ a )=nM(z i; i,a)=nn 



= lq=l 



and 



P (X\Z,tt) = ]Jp(X ij \Z i ,Z j ,Tr) 



i<3 



i<3 q,l 

=nn(^( i -^ ) 

«<i q,l 

In the case of a directed graph, the products over i < j must be replaced by 
products over i ^ j. The edges Xu must also be taken into account if the graph 
contains self-loops. 



Note that SBM is related to the infinite block model of Kemp et al ( 2004 1 
although the number Q of classes is fixed. Moreover, contrary to the mixed 



membership stochastic block model of Airoldi et al ( 2008 ) which captures partial 



membership and allows each vertex to have a distribution over a set of classes, 
SBM assumes that each vertex of a network belongs to a single class. 



The identifiability of SBM was studied by Allman et al (2009), who showed 
that the model is generically identifiable up to a permutation of the classes. In 
other words, except in a set of parameters which has a null Lebesgue's measure, 
two parameters imply the same random graph model if and only if they differ 
only by the ordering of the classes. 



2.2 A Bayesian Stochastic Block Model 



SBM can be described in a full Bayesian framework where it can be considered 



as a generalisation of the affiliation model proposed by Hofman and Wiggins 



(2008). Indeed, the Bayesian model of Hofman and Wiggins (2008) considers a 



simple structure where vertices of the same class connect with probability A and 
with probability e otherwise. Therefore, it can be seen as a constrained SBM 
where the diagonal of 7r is set to A and all the other elements to e. 

To extend the SBM frcqucntist model, we first specify some non informative 
conjugate priors for the model parameters. Since p(Z$ \a) is a multinomial 
distribution, we consider a Dirichlet distribution for the mixing coefficients 



P 



(a|n° = {<..., n£}) = Dir(a; n°), 



(2.1) 



where n® 



1/2, Vg. This Dirichlet distribution corresponds to a non-informative 
Jeffreys prior distribution which is known to be proper ( Jeffreys|1946 ) . It is also 
possible to consider a uniform distribution on the Q—1 dimensional simplex by 
fixing n Q q = 1, Vq. 
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Since p(Xij\ Z,;, Zj,7r) is a Bernoulli distribution, we use independent Beta 
priors to model the connectivity matrix 7T 

q<l 

with = (J = 1/2, Vg. This corresponds to a product of non-informative 
Jeffreys prior distributions. Note that if the graph is directed, the products over 
q <l, must be replaced by products over q,l since ir is no longer symmetric. 

Thus, the model parameters are now seen as random variables (see Figure [I]) 
whose distributions depend on the hyperparameters n°, rj°, and £°. In the fol- 
lowing, since these hyperparameters are fixed and in order to keep the notations 
simple, they will not be shown explicitly in the conditional distributions. 




v J 



Figure 1 Directed acyclic graph representing the Bayesian view of the stochas- 
tic block model. Nodes represent random variables, which are shaded when they 
are observed and edges represent conditional dependencies. 



3 Estimation 

In this section, we first describe the variational EM algorithm used by |Daudin 



et al (20081 to jointly estimate SBM model parameters and cluster the vertices 



of a network. We then propose a new variational Bayes EM algorithm for SBM 
which approximates the full posterior distribution of the model parameters and 
latent variables, given the observed data X. This procedure relies on a lower 
bound which will be later used, in Section |4j as a non asymptotic approximation 
of the marginal log-likelihood lnp(X). 
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3.1 Variational Approach 

The likelihood p(X|a,7r) of the observed data X can be obtained through 
the marginalization p(X | a, it) = p(X, Z | a, ir). This summation involves 
Q N terms and quickly becomes intractable. To tackle such problem, the well 
known EM algorithm ( Dempster et al|1977 McLachlan and Krishnan|1997 1 has 
been applied with success on a large variety of mixture models. This two stage 
estimation approach (Hathaway 1986 Neal and Hinton||1998 ) can be described 
in a variational inference framework. Thus, given a distribution q(Z) over the 
latent variables, the log-likelihood of the observed data is decomposed into two 
terms 

lnp(X | a, tt) =£((?(.); a,7r) +KL (q(.) \\p(.\ X,a,7r)), (3.1) 



where 



and 



KL 



C(q(.); a, 
(q(.)\\p(.\X, a ,TT) 



z 



g(Z)ln{ 



p(X,Z| 



9(Z) 



z 



?(z)M 



p(Z|X,a,7r) 



}■ 



(3.2) 



(3.3) 



In (3.1| and (3.3), KL denotes the Kullback-Leibler divergence between the 
distribution q{Z) and the distribution p(Z | X, a, 7r). Suppose that the current 
value of the model parameters is (a° ld , ir old ) . During the E-step, the lower 
bound £^q(.); a° ld ,ir old ^j is maximized with respect to q(Z) while holding the 
model parameters fixed. The solution to this optimization step occurs when the 
KL divergence vanishes, that is when q(Z) is equal to p(Z | X, ot old , ir old ). The 
lower bound is then equal to the log-likelihood of the observed data. In the 
M-step, the distribution q(Z) is held fixed and the lower bound is maximized 
with respect to the model parameters to give (ot new , ir new ). This causes the 
log-likelihood to increase. 

Unfortunately, when considering SBM, p(Z|X, a,7r) is not tractable and 
variational approximations are required. It can be easily verified that minimiz- 



ing (3.3) with respect to ?(Z) is equivalent to maximizing the lower bound (3.2 1 



of (3.1) with respect to q{Z). To obtain a tractable algorithm, Daudin et al 



(2008) assumed that the distribution q(Z) can be factorized such that 



iV 



JY 



q(Z) = Y[q(Z l ) = l[M(Z l - 1 l, Ti ), 



where Ti q is a variational parameter denoting the probability of node i to belong 
to class q. This gives rise to a so-called variational EM procedure. During the 



variational E-step, the model parameters are fixed and, by maximizing (3.2) 



with respect to q(Z), the algorithm looks for an approximation of the conditional 
distribution of the latent variables. Conversely, during the variational M-step, 
the approximation q{Z) is fixed and the lower bound is maximized with respect 
to the model parameters. This procedure is repeated until convergence and was 



proposed by Daudin et al (2008) for the SBM model. 
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3.2 Variational Bayes EM 

In the context of mixture models, the conditional distribution p(Z|X, Q!,7r) 
can generally be computed and therefore Bayesian inference strategies focus on 
estimating the posterior distribution p(a, tt | X). The distribution p(Z, a, tt | X) 
is then simply given by a byproduct. However, when considering SBM, the 
distribution p(Z | X, a., tt) is intractable and so we propose to approximate the 
full distribution p(Z, a, tt | X). We follow the work of Attias ( 1999 ), Corduneanu 



and Bishop (2001), Svensen and Bishop (2004) on Bayesian mixture modelling 



and Bayesian model selection. Thus, the marginal log-likelihood, also called 
integrated observed-data log-likelihood, can be decomposed into two terms 



lnp(X)=£ U.) +KL (</(.) || p(.|X 



where 



and 



(3.4) 



C {l(-))=J2 J J g(Z, a, tt) ln{ ] '^ z Z a a ^ ) }d a dTT, (3.5) 



KL b(-|X) 



E 
z 



/ / q Z,a,7r ln{ n 1' ' / }dad7T. 
J J 9(Z,a,7r) 



(3.6) 



Again, as for the variational EM approach (Section 3.1), minimizing (3.6) with 



respect to g(Z,a,7r) is equivalent to maximizing the lower bound (3.5) of (3.4) 
with respect to <?(Z, a, tt). However, we now have a full variational optimization 
problem since the model parameters are random variables and we are looking for 
an approximation g(Z, a, tt) of p(Z, a, tt \ X). To obtain a tractable algorithm, 
we assume that the distribution q(Z, a, tt) can be factorized such that 

N 

q(Z, a, tt) = q{a)q(TT)q(Z) - g(a)?(7r) J[ <z(Z,). 



In the following, we use a variational Bayes EM algorithm. We call variational 
Bayes E-step, the optimization of each distribution q(Zi) and variational Bayes 
M-step, the approximations of the remaining distributions q(a) and (/(tt). All 
the optimization equations, the lower bound, as well as proofs are given in the 
appendix. 

We first initialize a matrix T old with a hierarchical algorithm based on the 
classical Ward distance. The distance between vertices which is considered 
is simply the Euclidean distance d(i,j) — J2k=i(Xik ~ ATjfc) 2 which takes the 
number of discordances between i and j into account. Given a number of classes 
Q, each vertex is assigned (hard assignment) to its nearest group. Second, the 
algorithm uses (B.l ) and (C.l ) to estimate the variational distributions over the 



model parameters a as well as tt. Finally, the variational distribution over the 
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latent variables is estimated using (A.l). The algorithm cycles though the E 
and M steps until the absolute distance between two successive values of the 
lower bound (D.l) is smaller than a threshold eps. In the experiment section, 
we set eps = le — 6. In practice, smaller values slow the convergence of the 
algorithm and do not lead to better estimates. 

The computational costs of the frequentist approach of jDaudin et al (20081 
and our variational Bayes algorithm are both equal to 0(Q 2 N' 2 ). Analyzing a 
sparse network takes about a second for N — 200 nodes and about a minute for 
N = 1000. 



4 Model selection 



So far, we have seen that the variational Bayes EM algorithm leads to an ap- 
proximation of the posterior distribution of all the model parameters and latent 
variables, given the observed data. However, the problem of estimating the 
number Q of classes in the mixture has not been addressed yet. Given a set of 
values of Q, we aim at selecting Q* which maximizes the marginal log-likelihood 
lnp(X|Q), also called integrated observed-data log-likelihood. The marginal 
likelihood is known to focus on density estimation view and is expected to pro- 



vide a consistent estimation of the distribution of the data (Biernacki et al 



20101. Unfortunately, this quantity is not tractable, since for each value of Q, 



it involves integrating over all the model parameters and latent variables 



lnp(X|Q) 




p(X, Z, a, 7r \Q) dad-K 



To tackle this issue, we propose to replace the marginal log-likelihood with its 
variational Bayes approximation. Thus, given a value of Q, the algorithm in- 



troduced in Section 3.2 is used to maximize the lower bound (3.5 1 with respect 
to q(-). We recall that this maximization implies a minimization of the KL 
divergence (3.6) between q(.) and the unknown posterior distribution. After 



convergence of the algorithm, according to (3.4 1, if the KL divergence is small 



then the lower bound C\q(.)^ approximates the marginal log-likelihood. Ob- 
viously, this assumption can not be verified in practice since (3.6) can not be 



computed analytically. Moreover, we emphasize that there is no solid reason 
to believe that the KL divergence is close to zero and does not depend on the 
model complexity. Nevertheless, in order to obtain a tractable model selection 
criterion we rely on this approximation. After convergence of the algorithm, the 
lower bound takes a simple form and leads to a new criterion for SBM that we 
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call ILvb 



. t ^M rE r° ! "f- %) ) 

r(E?. 1 »,)n?.,r(»S) 

where Ti q is the estimated probability of vertex i to belong to class q and {n q ) q , 
(Vgi)qh (Cqi)qi are parameters given in the appendix. The gamma function is 



denoted by F(-). Contrary to the criterion proposed by Daudin et al (2008), 
ILvb does not rely on an asymptotic approximation, sometimes called BIC- 
like approximation. In practice, given a network, the variational Bayes EM 
algorithm is run for the different values of Q considered and Q* is chosen such 
that ILvb is maximized. 



5 Experiments 



We present some results of the experiments we carried out to assess the criterion 
we proposed in Section |4| Throughout our experiments, we chose to compare 
our approach to the work of Daudin et al (2008) and Hofman and Wiggins 



(2008). Indeed, contrary to many other model based techniques, the correspond- 



ing algorithms can analyze networks with hundred of nodes in a reasonable 
amount of time (a few minutes on a dual core). We recall that Daudin et al 



(2008) proposed a frequentist maximum likelihood approach (see Section 3.1| 
for SBM as well as an ICL criterion. On the other hand, |Hofman and Wiggins) 
(2008) presented a model for community structure detection and a Bayesian 



criterion that we will denote VBMOD. Thus, by using both synthetic data and 
the metabolic network of bacteria Escherichia coli, our aim is twofold. First, 
we illustrate the overall capacity of SBM to retrieve interesting structures in a 
large variety of networks. Second, we concentrate on comparing the two criteria 
ICL and ILvb developed for SBM. 



5.1 Comparison of the criteria 



In these experiments, we consider two types of networks. In Section 5.1.1 



we 



generate affiliation networks, made of community structures, using the genera- 



tive model of Hofman and Wiggins ( 2008 ). Therefore, vertices of the same class 
connect with probability A and with probability e otherwise. This corresponds 
to a constrained SBM where the diagonal of the connectivity matrix is set to A 
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and all the other elements to e 



/A e 
e A 

\e ... 



e A/ 



In Section [5.1.2[ we then draw networks with more complex topologies, made 
of both community structures and a class of hubs. The corresponding model is 
given by the connectivity matrix 



7T 



f\ 6 
£ A 

Va ... 



e A\ 



V 



where hubs connect with probability A to any vertices in the network. 



Following Mariadassou et al (20101 who showed that ICL tends to under- 



estimate the number of classes in the case of small graphs, we consider net- 
works with only TV = 50 vertices to analyze the robustness of our criterion. 
We set (A = 0.9, e = 0.1) and for each value of Qrrue m the set {3, . . . , 7}, 
we then generate 100 networks with classes mixed in the same proportions 

a l = ■ ■ ■ = a Q T ru C = l/QTrue- 

In order to estimate the number of classes in the latent structures, we applied 
the methods of |Hofman and Wiggins (2008), Daudin et al| (2008), and our 



algorithm (Section 3.2) on each network, for various numbers of classes Q S 
{1, . . . , 7}. Note that, we choose n° q = 1/2, Vq e {1, . . . , Q} for the Dirichlet 
prior and i] ql — C°; = 1/2, V(g, I) G {1, . . . , Q} 2 for the Beta priors. We recall 
that such distributions correspond to non informative prior distributions. Like 
any optimization technique, the clustering methods we consider depend on the 
initialization. Thus, for each simulated network and each number of classes Q, 
we use five different initializations of t. Finally, we select the best learnt models 
for which the corresponding criteria VBMOD, ICL, or ILvb were maximized. 

Before comparing ICL and ILvb, it is crucial to recall that these two crite- 
ria were not conceived for the same purpose. ICL approximates the integrated 
complete-data likelihood and is known to focus on cluster analysis view since it 
favors well separated clusters. It realizes a compromise between the estimation 
of the data density and the evidence of data partitioning. Conversely, ILvb 
approximates the marginal likelihood which is known to focus on density esti- 
mation only. In the following experiments, since networks are generated using 
SBM, and because we evaluate the criteria through their capacity to retrieve 
the true number of classes, ILvb is expected to lead to better results. However, 
in other situations (which are not considered in this paper), where the focus 
would be on the clustering of vertices, ICL might be of possible interest. 
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5.1.1 Affiliation Networks 

In Table [TJ we observe that VBMOD outperforms both ICL and ILvb. For 
instance, when Qrrue — 5, VBMOD correctly estimates the number of classes 
of the 100 generated networks, while ICL and ILvb have respectively a percent- 
age of accuracy of 77 and 99. These differences increase when Qrrue — 6 and 
Qrrue = 7. Indeed, the higher Qrrue is, the less vertices the classes contain, 
and therefore, the more difficult it is to retrieve and distinguish the commu- 
nity structures. Thus, when Qrrue = 7, each class only contains on average 
Qrrue/N ~ 7.1 vertices. VBMOD appears to be a very stable criterion for 
community structure detection. It has a percentage of accuracy of 84 while ICL 
never estimates the true number of classes. 



All the affiliation networks were generated using the model of Hofman and 



Wiggins (20081 which explains the results of VBMOD presented above. Indeed, 
the corresponding model for community structure detection only estimates the 
parameters A and e whereas the frequentist and Bayesian approaches for SBM 
look for a full Q x Q matrix 7r of connection probabilities. They are capable of 
handling networks with complex topologies, as shown in the following section, 
but they might miss some structures if the number of vertices is too limited. 

We observe that ILvb leads to a better estimates of the true number of 
classes in networks than ICL. Thus, when Qrrue = 5 an d Qrrue = 6, ILvb 
estimates correctly the number of classes of 99 and 73 networks while ICL has 
respectively a percentage of accuracy of 77 and 12. 

5.1.2 Networks with Community Structures and Hubs 

Table [2] displays the results of the experiments on networks exhibiting commu- 
nity structures and hubs. The presence of hubs is a central property of so-called 
real real networks ( Albert and Barabasi 2002 ) . 



This slightly more complex and more realistic situation does heavily perturb 
the estimation of VBMOD. Most of the time, VBMOD fails to detect the class 
of hub and henceforth underestimates the number of classes. For example, when 
Qrrue = 3 or Qrrue = 4, VBMOD always misses a class. When the number of 
true classes grows over four, VBMOD's behaviour becomes more variable but 
keep the same heavy tendency to underestimate. 

In this context, ICL and ILvb behaves more consistently than VBMOD. 
When Qrrue is less or equal than four both strategies are comparable. But 
when the number of true classes increases, the performance of ICL dramatically 
deteriorates, whereas ILvb remains more stable. 

In the context of small graph, when the focus is on the estimation of the 
data density, ILvb clearly provides a more reliable estimation of the number of 
class than ICL. It also shows better performances that VBMOD when networks 
arc made of classes which are not communities. 
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Table 1 Confusion matrices for VBMOD, ICL and ILvb. A = 0.9, e = 0.1 
and Qrrue € {3, . . . , 7}. Affiliation networks. 
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(c) 


QTrue\QlLvb 







5.2 The metabolic network of Escherichia coli 



We apply the methodology described in this paper to the metabolic network of 



bacteria Escherichia coli ( Lacroix et al 2006 ) which was analyzed by Daudin 



et al (2008) using SBM. In this network, there are 605 vertices which rep- 
resent chemical reactions and a total number of 1782 edges. Two reactions 
are connected if a compound produced by the first one is a part of the sec- 
ond one (or vice- versa). As in the previous section, we consider non informa- 
tive priors: we fixed n° q = 1/2, Vq 6 {1, . . . , Q} for the Dirichlct prior and 
C = 1/2, V(<?, € {1, . . . , Q} 2 for the Beta priors. 



Thus, for Q e {1, . . . ,40}, we apply the methods of Hofman and Wiggins 



(2008) as well as our approach on this network. We compute the corresponding 



criteria and we repeat such procedure 60 times, for different initializations of 
t. Indeed, to speed up the initialization, we first run a kmeans algorithm with 
40 classes and random initial centers. We then use the corresponding partitions 
as inputs of the hierarchical algorithm described in Section |3.2| The results for 
ILvb are presented as boxplots in Figure [2] The criterion finds its maximum for 
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Table 2 Confusion matrices for VBMOD, ICL and ILvb. A = 0.9, e = 0.1 
and Qrrue € {3, . . . , 7}. Affiliation networks and a class of hubs. 
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(c) QTrue\QlLvb 



QiLvb = 22 classes, while Daudin et al (2008) found Qicl = 21. Thus, for this 
particular large data set, both ILvb and ICL lead to almost the same estimates 
of the number of latent classes. 

We also compared the learnt partitions in the Bayesian and in the frequcn- 
tist approach. Figure [3] is a dot plot representation of the metabolic network 
after having applied the Bayesian algorithm for Qvb = 22. Each vertex i is 
classified into the class for which r^ g is maximal (Maximum A Posteriori esti- 
mate). We observed very similar patterns in the frequentist approach. Among 
the classes, eight of them are cliques Tr qq = 1 and six have within probability 
connectivity greater than 0.5. As shown by Daudin et al (2008), these cliques 
or pseudo-cliques gather reactions involving a same compound. Thus, choris- 
mate, pyruvate, L-aspartate, L-glutamate, D-glyceraldehyde-3-phosphate and 
ATP are all responsible for cliques. Moreover, as observed in |Daudin et al| 
(2008), since the connection probability between class 1 and 17 is 1, they corre- 
spond to a single clique which is associated to pyruvate. However that clique is 
split into two sub-cliques because of their different connectivities with reactions 
of classes 7 and 10. Since the approach of Hofman and Wiggins (2008) only 
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Figure 2 Boxplot representation (over 60 experiments) of ILvb for Q € 
{!,..., 40}. The maximum is reached at QiL vb = 22. 



looks for community structures, it can not retrieve such complex topologies, as 



shown in Section 5.1.2 and many classes such as class 1 and 17 were merged. 
We found Qvbmod = 14. 



6 Conclusion 

In this paper, we showed how the Stochastic Block Model (SBM) could be 
described in a full Bayesian framework. We introduced some non informative 
conjugate priors over the model parameters and we described a variational Bayes 
EM algorithm which approximates the posterior distribution of all the latent 
variables and model parameters, given the observed data. Using this framework, 
we derived a non asymptotic model selection criterion, so-called ILvb, which ap- 
proximates the marginal likelihood. By considering networks generated using 
SBM, we showed that ILvb focus on the estimation of the data density and 
provides a relevant estimation of the number of latent classes. We also illus- 
trated the capacity of SBM to retrieve interesting structures in a large variety 
of networks, contrary to algorithms looking for community structures only. In 
future work, we will investigate approximate Bayesian computation methods for 
model selection. These simulation techniques seem particularly promising for 
the analysis of SBM where the likelihood of the observed data is intractable. 
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Figure 3 Dot plot representation of the metabolic network after classification 
of the vertices into Qvb = 22 classes. The x-axis and y-axis correspond to the 
list of vertices in the network, from 1 to 605. Edges between pairs of vertices 
are represented by shaded dots. 
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Appendix 

A Approximation of g(Zj) the conditional distri- 
bution of the latent variables 

The optimal approximation at vertex i is 

q(Zi)=M(Zi; l,r i = {T il ,...,T iQ }), (A.l) 

where Ti q is the probability (responsability) of node i to belong to class q. It 
satisfies the relation 

N Q Tji \-4'(C, q i)-i)(ri q i+Q q i)+X t Aip(v q i)- 0(d)) 

jjti 1=1 

where tp{.) is the digamma function. In order to optimize the distribution q(Z), 
we rely on a fixed point algorithm. Thus, given a matrix T old , the algorithm 
builds a new matrix r new where each rows satisfies (A.2). After normalization, 
it then uses r new to build a new matrix and so on. The algorithm stops when 
YliLi J2^=i \ T iq d ~ T iq CW \ < e P s - I n the experiment section, we set eps — le — 6. 

Proof: According to variational Bayes, the optimal distribution q(Zi) is 
given by 

lnQ(Zi) = E z v a7r [lnp(X,Z,a,7r)] + est 

= E z \ %7r [lnp(X | Z, tt)] + E z v a [lnp(Z |a)] + est 

= E zv Z V qZji[X Vj ln7r 9 j + (1 - Ajvj) ln(l - tt,,))] 

i'<j 9.' 

JV Q 

+ E Z\',«[X!5Z Zi '9 lna 9] +CSt 
i' = l g =l 

Q / N Q 

= ^ Z iq I E Q? [lna 9 ] + T i ( ( E ^ [ ln7r 9'] ~ E ^ [ ln ( 1 ~ 

9=1 V i¥« i=l 

+ E ff9! [ln(l-7r gi )])^ +cst 

Q / N N Q 

9=1 V i=l 3=fi 1=1 

+ HCql) ~ ^(Vql + Cql))^ + CSt, 

(A.3) 
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where Z^ 1 denotes the class of all nodes except node i. We have used E a [lny] = 
ip(a)—tp(a+b) when y ~ Beta(y; a, b). Moreover, to simplify the calculations, the 
terms that do not depend on Zj have been absorbed into the constant. Taking 
the exponential of (IA.3I) and after normalization, we obtain the multinomial 



distribution (A.l I 



B Optimization of q{ot). 

The optimization of the lower bound with respect to q{a) produces a distribu- 
tion with the same functional form as the prior p(a) 

q{a) = Dir(a; n), (B.l) 

where 

N 

n q = n°g+J2 r n- ( B - 2 ) 

i=l 

Proof: According to variational Bayes, the optimal distribution q(a) is 
given by 

In q(at) = Ez.k [hip(X, Z, ex, n)] + est 

= E z [lnp(Z | a.)} + lnp(a) + est 

N Q Q 

= ^^T i9 ma g + ^(n°-l)ma: 9 +cst ( B3 ) 



i — 1 q— 1 9— 1 

Q / W \ 

9=1 V f=l / 



a q + est. 



distribution (B.l I 



Taking the exponential of (B.3) and after normalization, we obtain the Dirichlet 



C Optimization of q(n). 



Again, the functional form of the prior p(ir) is conserved through the variational 
optimization: 

Q 

q(n) = JjBeta(7r gi ; r) q i,( q i), (C.l) 

q<l 

For q ^ I, the hyperparameter r\ q i is given by 

N 

Vql = Vql + X V T iq T 3U ( C ' 2 ) 

and Vg: 

N 

Vqq =Vqq+Yl XijTigTjg. (C.3) 

i<j 
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Moreover, for q ^ I, the hyperparameter is given by 

JV 

and Vg: 

AT 

C M = & + £(1-*«)tW ( C ' 5 ) 

i<j 

Proof : According to variational Bayes, the optimal distribution g(7r) is 
given by 

lnq(7r) = Ez )C *[hip(X, Z, a, 7r)] + est 

= E z [lnp(X|Z,7r)] + rnp(-7r) + est 

AT Q 

= E E Ti <i r i i { x ^ ln n i i + ( 1 ~ x «) m* ~ ^o) 

*<i 1,1 

Q 



E (K< - !) ln7r ?< + - !) Ml - ^)) + est 

q<l 

Q N 

E E Ti ? r ^ ln ^ + ( x ~ M 1 ~ ^o) 

q<l ijij 

Q N 

+ E E T iQ T 3Q\ X ij ln7r ?g + C 1 _ M 1 ~ ^gg)) 

9=1 *<j 

Q 

+ E (« ^ l ) ln7r ?' + C$ - !) Ml ~ ^)) + cst 

Q / N N \ 

= E - 1 + E + (Cg° - 1 + E - x ^) mi - 

g<i \ i^j / 

Q I N N \ 

+ E ( (»& - 1 + E T <i T i9 X v) ln7r ^ + (Cgg - 1 + E T ^0- - X ^) M* - *n) 

9=1 V i<j i<7 > 

(C.6) 



of Beta distribution ( C.l ) 



Taking the exponential of (C.6) and after normalization, we obtain the product 



D Lower bound. 

The lower bound takes a simple form after the variational Bayes M-step. Indeed, 
it only depends on the posterior probabilities Tj g as well as the normalizing 



21 



constants of the Dirichlet and Beta distributions 

ff n \ . r r(E^<)n^rK) Q f r( T? g l + <g I )rfa t )r(c„) 

£ ^ = ln{ r(ES^)nS^ }+ S 

(D.l) 

Proof : The lower bound is given by 

= Ez !Ct!7r [lnp(X, Z,a,n)] - E ZiCt!7r [lnq(Z, a, 7r)] 

= Ez !7r [lnp(X | Z,tt)] + E z , a [lnp(Z I a)] + E Q [lnp(a)] + E 7r [lnp(7r)] 

AT 



^ EzJln^Z,)] - E a png(a)] - E^ln^Tr)] 

i=l 

N Q Q Q Q 

+ E E T ^ (V K) - ^(E + ln r GC - E ln r K) 

2—1 g=l I — 1 q— 1 g— 1 

+ E (»S - x ) (v-K) - ^(E »')) + E f lnr « + 0) 

9=1 Z=l 9<( V 

- inr($) - inr(c 9 ° ; ) + ( V ° q i - i) (VM - ^ + CO) 

\ N Q 

+ (C° q i i) (v>(C 9 /) - 1>(v Q i + C,i)) - E E T ^ lnr ^ 

/ i=l 9=1 

Q Q Q Q 

- lnr(^ n,) + ^ lnl> 9 ) - £ (n q l) (V(n g ) - V(E »0j 

9=1 9=1 9=1 Z=l 

Q / 

£ inrfa,, + c 9 - inr(^) - lnr(Q) 

q<l \ 

+ 



i=l 

AT Q 



(fyj - !) (^(^) - ^(%J + C9O) + (Ql - 1) (V'(Cgi) - ^(Vql + Cql)) 

(D.2) 
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Therefore 

Q / N 



q<l \ ijij 

N \ 

+ ($ - c 9 * + X - x v)) (^(Cfl») - ^ + c,o) 

Q / AT 

+ S I (^M ~ Vqq + X T iq T Jq X ij) (V>(W ~ <P(Vqq + (qqi) 
9=1 \ »<j 

AT \ 

+ (C ~ C99 + X T "> T ^ 1 - *«)) - + C OT )) 

Q JV Q 

+E(»2-»*+E Ti ~ n,) ) 

g=l i=l i=l 

AT Q 

-EE 7 -? ln7 v 

i = l g=l 

(D.3) 

After the variational Bayes M-step, most of the terms in the lower bound vanish 
since 

• Mq : n q = n a q +J2iLi T i q - 

• Vq i= I : r] q i = jf ql + £gy ^j^^,, 

• Vg : r/ gg = rf qq + E^- XijT iq T jq . 

. Vg^ / : Q = C° ql + E^(l - A%-)t 19 t j7 , 

• V 9 : Cgg = Cqq + E^jC 1 ~ X ii)n q Tj q . 

Only the terms depending on the probabilities r i? and the normalizing constants 
of the Dirichlet and Beta distributions remain. 
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